87 research outputs found

    Dynamically tunable memory hierarchy

    Get PDF
    Journal ArticleThe widespread use of repeaters in long wires creates the possibility of dynamically sizing regular on-chip structures. We present a tunable cache and translation lookaside buffer (TLB) hierarchy that leverages repeater insertion to dynamically trade off size for speed and power consumption on a per-application phase basis using a novel configuration management algorithm. In comparison to a conventional design that is fixed at a single design point targeted to the average application, the dynamically tunable cache and TLB hierarchy can be tailored to the needs of each application phase. The configuration algorithm dynamically detects phase changes and selects a configuration based on the application's ability to tolerate different hit and miss latencies in order to improve the memory energy-delay product. We evaluate the performance and energy consumption of our approach and project the effects of technology scaling trends on our design

    Dynamic Data Dependence Tracking and Its Application to Branch Prediction

    Get PDF
    To continue to improve processor performance, microarchitects seek to increase the effective instruction level parallelism (ILP) that can be exploited in applications. A fundamental limit to improving ILP is data dependences among instructions. If data dependence information is available at run-time, there are many uses to improve ILP. Prior published examples include decoupled branch exectuion architectures and critical instruction detection. In this paper, we describe an efficient hardware mechanism to dynamically track the data dependence chains of the instructions in the pipeline. This information is available on a cycle-by-cycle basis to the microengine for optimizing its perfromance. We then use this design in a new value-based branch prediction design using Available Register Value Information (ARVI). From the use of data dependence information, the ARVI branch predictor has better prediction accuracy over a comparably sized hybrid branch perdictor. With ARVI used as the second-level branch predictor, the improved prediction accuracy results in a 12.6% performance improvement on average across the SPEC95 integer benchmark suite

    Dynamically Matching ILP Characteristics Via a Heterogeneous Clustered Microarchitecture

    Get PDF
    Applications vary in the degree of instruction level parallelism (ILP) available to be exploited by a superscalar processor. The ILP can also vary significantly within an application. On one end of the microarchitecture space are monolithic superscalar designs that exploit parallelism within an application. At another end of the spectrum are clustered architectures having many simple cores that can be clocked at a higher frequency than a comparable monolithic design. A disadvantage of the clustered design is the cost to transmit results between clusters which potentially limits the performance even in high ILP applications if instructions are not mapped carefully to minimize cross communication. In this paper, we propose an approach that incorporates the strengths of the prior two by clustering multiple integer ALUs in an asymmetric fashion. In one cluster, a 6-way out-of-order pipeline efficiently executes code having high ILP. In another cluster, a simpler, but deeper, 2-way pipeline running at twice the frequency speeds up regions of code having little ILP. We use the parallelism metrics gathered from the dynamic Data Dependence Tracking (DDT) mechanism to dynamically steer instruction windows to a cluster. We demonstrate that the heterogeneous cluster design can improve performance by up to 30% over a monolithic 8- way superscalar processor

    A High Performance, Energy Efficient GALS ProcessorMicroarchitecture with Reduced Implementation Complexity

    Full text link
    As the costs and challenges of global clock distribution grow with each new microprocessor generation, a Globally Asynchronous, Locally Synchronous (GALS) approach becomes an attractive alternative. One proposed GALS approach, called a Multiple Clock Domain (MCD) processor, achieves impressive energy savings for a relatively low performance cost. However, the approach requires separating the processor into four domains, including separating the integer and memory domains which complicates load scheduling, and the implementation of 32 voltage and frequency levels in each domain. In addition, the hardwarebased control algorithm, though effective overall, produces a significant performance degradation for some applications. In this paper, we devise modifications to the MCD design that retain many of its benefits while greatly reducing the implementation complexity. We first determine that the synchronization channels that are most responsible for the MCD performance degradation are those involving cache access, and propose merging the integer and memory domains to virtually eliminate this overhead. We further propose significantly reducing the number of voltage levels, separating the Reorder Buffer into its own domain to permit front-end frequency scaling, separating the L2 cache to permit standard power optimizations to be used, and a new online algorithm that provides consistent results across our benchmark suite. The overall result is a significant reduction in the performance degradation of the original MCD approach and greater energy savings, with a greatly simplified microarchitecture that is much easier to implement

    Improving Application Performance by Dynamically Balancing Speed and Complexity in a GALS Microprocessor

    Get PDF
    Microprocessors are traditionally designed to provide “best overall” performance across a wide range of applications and operating environments. Several groups have proposed hardware techniques that save energy by “downsizing” hardware resources that are underutilized by particular applications. We explore the converse: “upsizing” hardware resources in order to improve performance relative to an aggressively clocked baseline processor. Our proposal depends critically on the ability to change frequencies independently in separate domains of a globally asynchronous, locally synchronous (GALS) microprocessor. We use a variant of our multiple clock domain (MCD) processor, with four independently clocked domains. Each domain is streamlined with modest hardware structures for very high clock frequency. Key structures can then be upsized on demand to exploit more distant parallelism, improve branch prediction, or increase cache capacity. Although doing so requires decreasing the associated domain frequency, other domain frequencies are unaffected. Measuring across a broad suite of application benchmarks, we find that configuring just once per application increases performance by an average of 17.6% compared to the best fully synchronous design. When adapting to application phases, performance improves by over 20%

    Profile-based Dynamic Voltage and Frequency Scaling for a Multiple Clock Domain Microprocessor

    Get PDF
    A Multiple Clock Domain (MCD) processor addresses the challenges of clock distribution and power dissipation by dividing a chip into several (coarse-grained) clock domains, allowing frequency and voltage to be reduced in domains that are not currently on the application’s critical path. Given a reconfiguration mechanism capable of choosing appropriate times and values for voltage/frequency scaling, an MCD processor has the potential to achieve significant energy savings with low performance degradation. Early work on MCD processors evaluated the potential for energy savings by manually inserting reconfiguration instructions into applications, or by employing an oracle driven by off-line analysis of (identical) prior program runs. Subsequent work developed a hardware-based on-line mechanism that averages 75–85% of the energy-delay improvement achieved via off-line analysis. In this paper we consider the automatic insertion of reconfiguration instructions into applications, using profiledriven binary rewriting. Profile-based reconfiguration introduces the need for “training runs” prior to production use of a given application, but avoids the hardware complexity of on-line reconfiguration. It also has the potential to yield significantly greater energy savings. Experimental results (training on small data sets and then running on larger, alternative data sets) indicate that the profile-driven approach is more stable than hardware-based reconfiguration, and yields virtually all of the energy-delay improvement achieved via off-line analysis

    Dynamically Trading Frequency for Complexity in a GALS Microprocessor

    Get PDF
    Microprocessors are traditionally designed to provide “best overall” performance across a wide range of applications and operating environments. Several groups have proposed hardware techniques that save energy by “downsizing” hardware resources that are underutilized by the current application phase. Others have proposed a different energy-saving approach: dividing the processor into domains and dynamically changing the clock frequency and voltage within each domain during phases when the full domain frequency is not required. What has not been studied to date is how to exploit the adaptive nature of these approaches to improve performance rather than to save energy. In this paper, we describe an adaptive globally asynchronous, locally synchronous (GALS) microprocessor with a fixed global voltage and four independently clocked domains. Each domain is streamlined with modest hardware structures for very high clock frequency. Key structures can then be upsized on demand to exploit more distant parallelism, improve branch prediction, or increase cache capacity. Although doing so requires decreasing the associated domain frequency, other domain frequencies are unaffected. Our approach, therefore, is to maximize the throughput of each domain by finding the proper balance between the number of clock periods, and the clock frequency, for each application phase. To achieve this objective, we use novel hardware-based control techniques that accurately and efficiently capture the performance of all possible cache and queue configurations within a single interval, without having to resort to exhaustive online exploration or expensive offline profiling. Measuring across a broad suite of application benchmarks, we find that configuring our adaptive GALS processor just once per application yields 17.6% better performance, on average, than that of the “best overall” fully synchronous design. By adapting automatically to application phases, we can increase this advantage to more than 20%

    Managing Static Leakage Energy in Microprocessor Functional Units

    Get PDF
    Static energy due to subthreshold leakage current is projected to become a major component of the total energy in high performance microprocessors. Many studies so far have examined and proposed techniques to reduce leakage in on-chip storage structures. In this study, static energy is reduced in the integer functional units by leveraging the unique qualities of dual threshold voltage domino logic. Domino logic has desirable properties that greatly reduce leakage current while providing fast propagation times. However, due to the energy cost of entering the low leakage current state (sleep mode), domino logic has thus far been used only for leakage reduction in the longterm standby mode. We examine the utility of the sleep mode (while considering the aforementioned costs) when idle times are relatively short, one to a few hundred cycles, as is often the case for functional units. Using an analytical energy model suitable for architecture-level analysis, we explore the interaction of the application and technology, and the effect on energy and performance as the underlying parameters are varied, on a set of benchmarks. Our results show that if the leakage approaches the magnitude as projected in the literature, even for short idle intervals as few as ten cycles, an aggressive policy of activating the sleep mode at every idle period performs well and a more complex control strategy may not be warranted. We also propose a simple design, called Gradual Sleep, to reduce the energy impact of using the sleep mode for smaller idle periods

    Dynamic Frequency and Voltage Scaling for a Multiple-Clock-Domain Microprocessor

    Get PDF
    Multiple clock domains is one solution to the increasing problem of propagating the clock signal across increasingly larger and faster chips. The ability to independently scale frequency and voltage in each domain creates a powerful means of reducing power dissipation

    Hiding Synchronization Delays in a GALS Processor Microarchitecture

    Get PDF
    We analyze an Alpha 21264-like Globally–Asynchronous, Locally–Synchronous (GALS) processor organized as a Multiple Clock Domain (MCD) microarchitecture and identify the architectural features of the processor that influence the limited performance degradation measured. We show that the out-oforder superscalar execution features of a processor, which allow traditional instruction execution latency to be hidden, are the same features that reduce the performance degradation impact of the synchronization costs of an MCD processor. In the case of our Alpha 21264-like processor, up to 94% of the MCD synchronization delays are hidden and do not impact overall performance. In addition, we show that by adding out-of-order superscalar execution capabilities to a simpler microarchitecture, such as an Intel StrongARM-like processor, as much as 62% of the performance degradation caused by synchronization delays can be eliminated
    • …
    corecore